Data Science with
DAY 1
A language is a method or system of communication. The use of language for communication between humans doesn’t have to be exactly precise, we can talk with unfinished sentences or grammatically incorrect language and still be understood. The same doesn’t hold when we communicate with computers through a language they understand. R is a language that serves as a way to communicate with computers with the main purpose of having them perform data analysis. R knows how to convey our commands to the computers but we must give our commands to R in a very specific way. In this course we will acquire vocabulary and learn ways to communicate with R.
There are other languages that talk to computers for the purpose of data analysis (e.g. Excel, SAS, Stata). The reason we promote R is because the way we talk to computers is through code, plain text written in a text file, which can be saved and used by anyone else to produce exactly the same results we did. This feature is called reproducibility. Also, R, beyond being a computer language, is a flexible system that allows the entire data analysis process, from data collection to reporting, to be carried out without leaving the same environment provided by RStudio. This makes analysis less error-prone and facilitates external scrutiny. Another good feature of R is that it can retrieve information prepared to feed other software and transform R output to be fed into other systems. R can also talk directly to other systems (for example R can query a SQL database before importing it).
We think R is a great place to start your data science journey because it is an environment designed from the ground up to support data science. R is not just a programming language, but it is also an interactive environment for doing data science. To support interaction, R is a much more flexible language than many of its peers. This flexibility comes with its downsides, but the big upside is how easy it is to evolve tailored grammars for specific parts of the data science process. These mini languages help you think about problems as a data scientist, while supporting fluent interaction between your brain and the computer. From “R for data science” by H.Wickham and G.Grolemund
The reproducibility feature of R is very important. Whilst reproducibility doesn’t guarantee that an analysis or its conclusions are correct, it guarantees that, given the same data or inputs, it can be performed in the exact same fashion by other analysts, e.g. for verification. The complexity of contemporary data analysis requires reproducibility which, in the end, is all an analyst can guarantee.
Nowadays, code is seen as more than just syntax to communicate with computers. Code is viewed as something more abstract, as “the notation of data science”, just as what mathematical notation is to mathematics or physics.
This course focuses on expanding the knowledge of the R language by learning ways to feed data in different formats in R and how to prepare such data so that it is ready for analysis. We will do this mostly through a series of case studies using real data, each presenting different challenges. This course also will introduce you to simple data visualisation, useful to inspect and prepare a data set and also plan the analysis stage.
“There are no routine statistical questions, only questionable statistical routines.” — Sir David Cox
Data analysis can be summarised as a process which must contain the following 5 stages:
Reflection on problem and resources.
Data collection.
Data preparation.
Data analysis.
Reporting of results.
The data analysis process is not a non-stop, 5-step unidirectional process, where steps are followed in strict sequence, but a process with 5 stages, which appear in the order above, but where at any point of the analysis it might be necessary to return to a previous stage and re-start the process from there on, according to the obtained results and whether or not the results are useful for the investigation. For example, once the analysis stage has been reached, we may realise that we need more information (additional data) and so we go back to the very start, reflecting and preparing more data in order to continue with the analysis. Or perhaps, at any stage, we realise that the data collected is not suitable to answer the question that motivated the analysis.
We explain in what follows what each stage involves.
Reflection
“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey
Before any contact with digital resources, the analyst should reflect on the problem at hand. What are the questions that need to be answered? What resources are needed to answer such questions? What resources are available? If the data is not yet available, then it is crucial to carefully plan how information will be gathered so that questions of interest can be answered.
If data is already available, if possible, have a look at the raw data set, or at least parts of it. Envisage possible issues or challenges and a possible action plan. For example, the data is presented in several files/sheets that will need to be unified. Or, the data comes in a single file which might need to be split into several related files or it has non-informative data which will need to be trimmed. Perhaps unusual values or many empty cells are observed.
Keep on returning to this stage as needed throughout the data analysis process.
Data collection
Most likely, the analyst does not physically collect the data but it is handed to them by a third party. The way in which the data is transferred to the analyst can be of multiple ways and the challenge now is to be able to further transfer the data to the chosen system for analysis, in our case R.
Data preparation
Once the data has been imported into the system, the issue is whether the format of the data is adequate for the system to carry out analyses. Data can be presented to the analyst organised in a myriad of equivalent ways. But there is at least one correct format for each system. We will discuss the notion of “correct format” or tidy data in what follows and learn how to organise data in R so that it has tidy features. Tidy data makes communication with R easier, helps towards reproducibility objectives and produces a less error-prone analysis. There are other common issues with data, such as errors (mistyped entries), duplication of values, missing values, abnormal values, outliers, etc. We will explore ways to uncover these issues in data and how to deal with them.
The data preparation step is quite a complex one that requires a great deal of attention, effort and communication with the data originator. In the New York Times article For Big-Data scientists, `Janitor work’ is key hurdle to insights (Aug 17, 2014) we read
“Yet far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets. . . . Data experts try to automate as many steps in the process as possible. “But practically, because of the diversity of data, you spend a lot of your time being a data janitor, before you can get to the cool, sexy things that got you into the field in the first place””
Despite the bad PR for data preparation, the task is crucial if meaningful results are to be obtained from the analysis process. Messy and unprepared data sets used for analyses can give meaningless results. Remember: rubbish in - rubbish out.
Besides, one of the goals is reproducibility and streamlined analyses, with results that anyone can replicate and verify. Adopting common standards for what constitutes a tidy data set is essential for both correctness and reproducibility.
Data analysis
Once our data is readable in the chosen system, has been prepared in the required format and the best effort to clean it has been made, it can be passed on to routines that will analyse the data and quantify the results using adequate paradigms and procedures.
Reporting
The entire process above is reported for checking and dissemination. The latest trend in reporting is along the lines of reproducible research or analysis. It uses interactive digital platforms which can communicate with the data and software without leaving the system, from data collection to reporting. In this way, anyone can easily check and reproduce results. The preferred digital platform for R users is R Markdown which produces interactive reports, pdf documents, slides, etc. R Markdown also supports other programming languages. Some of the available language engines include:
Python
SQL
Bash
Rcpp
Stan
JavaScript
CSS
REMEMBER: The process collection Reflection \(\rightarrow\) Collection \(\rightarrow\) Preparation \(\rightarrow\) Analysis \(\rightarrow\) Reporting is not a unidirectional process. At any stage, it might be necessary to return to the Reflection or Collection step and there might be several loops between Preparation and Analysis. Also, don’t worry if you spend a large proportion of your time preparing, tidying and understanding your data. It is expected you will.
R courseIn this course we will focus mainly on the first three steps of the data analysis process using the R software, preferably in the context of R Studio.
In the first step, we will see ways of importing data in several digital formats into R (excel, SAS, STATA, SPSS, .csv, .tsv, web data, SQL, etc.)
No fixed protocol for data preparation will take care of everything that needs to be manipulated in a data set to make it ready for analysis. Nevertheless, we will define minimum qualities that a clean or tidy data set must have and the most common features shown by “dirty” or messy data. We will show how to fix issues via manipulation of the data arrays and ways to detect common abnormalities via data summaries and simple data visualisation.
R packages for data science
R is open source software, free for anyone to use and develop. R has a basic vocabulary which allows for most operations to be carried out. Functions, the actions that R can carry out, are organised in “packages”. One of the packages containing many of the basic functions that R performs is the package utils, which loads automatically each time R is invoked. Being open source, individuals or groups develop packages, which are groups of functions based on R and perhaps other languages, that facilitate certain actions or data analysis tasks. Packages need to be installed in R, just once, and invoked in each new session where you need the package.
The Tidyverse is a collection of R packages specially developed for data science. In the Tidyverse package authors’ own words:
“The Tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.”
We will use mostly packages from the Tidyverse and from the basic R distribution. The packages in the Tidyverse make data manipulation and other actions easy and logical. One-line functions in Tidyverse packages save us from having to type many lines of complex coding. So, not only does the Tidyverse save time to the data scientists but also makes their analysis less error-prone.
The packages from the Tidyverse we will use in this course are: readr, readxl and haven (importing data), tidyr, dplyr, lubridate and stringr (tidying and manipulating data sets). For visualisation we will use mostly ggplot2.
Suggested literature (click on the image for a free, online version)
Also, there are useful “cheatsheets” with brief summaries of functions available in packages and basic use. Click on the cheatsheet to access a pdf
Relevant for this course is the Data Wrangling cheatsheet
#make required libraries available
library(dplyr)
library(tidyr)
library(readr)
library(stringr)
library(readxl)
library(ggplot2)
library(haven)
library(lubridate)
library(haven)
library(purrr)
library(vcd)
You should have downloaded a folder from GitHub containing some files and a sub-folder called “DataFiles”. Note where the downloaded folder is and name it something meaningful, e.g. “DSWR”. Go to Rstudio and from the menu at the top of the screen choose “Session”, “Set Working Directory”, “Choose directory”, and then choose the folder DSWR. DSWR is now your working directory and the folder DataFiles should be within it.
RStudioRStudio is an integrated development environment (IDE) for R. RStudio not only contains a console (the “smart” part) but also an editor for R scripts, R Notebooks and R Markdown documents. You can control and manage your workspace. There are many other features. See https://www.rstudio.com/products/rstudio/.
We will focus on using R Studio to produce R Notebooks, a particular type of R Markdown document. In a R Notebook we intercalate text and code. The bits with code are called “chunks”. The chunks can be executed independently and interactively, with output visible immediately beneath the input.
Notebook chunks are inserted using the keyboard shortcut Ctrl + Alt + I (macOS: Cmd + Option + I), or via the Insert menu in the editor toolbar. Also, a chunk can be inserted manually by typing ```{r}. You start typing code in the next line. Finally to close the chunk type ``` in a separate line. Below is an example of two chunks with their outputs.
Note how you can write comments inside a chunk. Comments are always preceeded by #.
Any R Markdown document can be used as a notebook, and all R Notebooks can be rendered to other R Markdown document types. A notebook can therefore be thought of as a special execution mode for R Markdown documents. The immediacy of notebook mode makes it a good choice while authoring the R Markdown document and iterating on code. When you are ready to publish the document, you can share the notebook directly, or render it to a publication format with the Knit button.
See https://bookdown.org/yihui/rmarkdown/notebook.html
An R markdown notebook has the following characteristics
Reproducible workflow by default.
Code, output and narrative in a single document.
Output to multiple formats (html, PDF, Word).
Output inline with code: When you execute code within the notebook, the results appear beneath the code.
It is possible to share output and code in a single file.
The notebook source code is a .Rmd file
Notebooks have an associated HTML file (.nb.html). It is created automatically when the .Rmd is saved. When you save a notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.
When a .nb.html file is opened in RStudio it will automatically open a .Rmd file with the source code.
Go through the file SampleRNotebook.Rmd and understand the structure of the file, sections, itemized lists, R chunks, etc.
Data may come in many formats, from simple text files, through excel files or workbooks, to data prepared for other statistical software such as SAS, Stata, etc. Or data may need to be collected from the web. R is a versatile software that allows data in almost any format to be read and analysed.
The most common type of file is a “flat file”, a text file with a bi-dimensional array of entries or cells. The entries can be separated by either a tab, blank space, or a comma.
We have a few options (packages) here. When you load R, a package named utils loads automatically with a set of basic R functions. The package utils contains some useful functions to import flat files into R.
We will use the package readr from the Tidyverse collection as it is more suited for consistency with the approach to data science in this course. There is less to type when using functions from readr and less typing means less chances of error.
Most of readr’s functions are concerned with turning flat files into data frames:
The functions read_table() and read_table2() are designed to read the type of textual data where each column is separated by one (or more) columns of space.
read_table2() allows any number of white space characters between columns, and the lines can be of different lengths.
read_table() is more strict, each line must be the same length, and each field is in the same position in every line. It first finds empty columns and then parses like a fixed width file.
read_csv() reads comma delimited files, read_csv2() reads semicolon separated files (common in countries where , is used as the separator of integer and decimal), read_tsv() reads tab delimited files, and read_delim() reads in files with any delimiter.
read_fwf() reads fixed width files. You can specify fields either by their widths with fwf_widths() or their position with fwf_positions(). read_table() reads a common variation of fixed width files where columns are separated by white space.
read_log() reads Apache style log files. (But also check out webreadr which is built on top of read_log() and provides many more helpful tools.)
| utils | readr | use |
|---|---|---|
| read.table() | read_delim() | read any flat file with any delimiter |
| read.csv() | read_csv() | read comma separated values |
| read.delim() | read_tsv() | read tab delimited files |
The most relevant functions for importing data in the package utils are: read.table(), read.csv() and read.delim(). The latter two are special cases of read.table(). Similarly, in the package readr the general function is read_delim(), of which read_csv() and read_tsv() are special cases.
Compared to the corresponding base functions, readr functions:
use a consistent naming scheme for the parameters (e.g. col_names and col_types not header and colClasses);
are much faster (up to 10x);
leave strings as is by default, and automatically parse common date/time formats;
have a helpful progress bar if loading is going to take a while;
all functions work exactly the same way regardless of the current locale. To override the US-centric defaults, use locale().
There is a package called data.table with a function similar to read_csv() called fread(). According to the readr developers, compared to fread(), readr functions:
are slower (currently about 1.2 to 2 times slower. If you want absolutely the best performance, specially when you are dealing with a very large flat file, use data.table::fread() (this notation means the function fread from the package data.table);
use a slightly more sophisticated parser, recognising both doubled and backslash escapes, and can produce factors and date/times directly;
forces you to supply all parameters, where fread() saves you work by automatically guessing the delimiter, whether or not the file has a header, and how many lines to skip;
are built on a different underlying infrastructure. readr functions are designed to be quite general, which makes it easier to add support for new rectangular data formats. fread() is designed to be as fast as possible.
When working with very large data sets (10 Gb - 100 Gb) it is recommended to work with the package data.table.
EXAMPLE. The file “EXER_age_sex_race.csv”, in your working directory, contains demographic information about people who took part in a study to assess the health benefits of an exercise plan (to be re-visited).
data1 <- read_csv("DataFiles/EXER_age_sex_race.csv")
## Parsed with column specification:
## cols(
## subject_ID = col_integer(),
## SexAge_Race = col_character()
## )
The data has been imported into an object called data1. Note that R responded not only by creating the data frame data1 containing the information in the file “EXER_age_sex_race.csv”, but also telling us what type of columns data1 has. data1 has two columns: subject_ID, a column of integers, and SexAge_Race, which is a column of character strings (each element of this column is a mixture of numbers, symbols and letters; R interprets these strings as a sequence of characters, with no numerical value). This is a useful feature of read_csv(). The first thing we want to check after importing data is how columns were read.
To inspect what type of object data1 is, we use the function class()
class(data1)
## [1] "tbl_df" "tbl" "data.frame"
We can see the top 6 rows of data1,
head(data1)
## # A tibble: 6 x 2
## subject_ID SexAge_Race
## <int> <chr>
## 1 1 MALE41.2_White
## 2 2 FEMALE42.9_White
## 3 3 FEMALE38.5_White
## 4 4 FEMALE35.6_Hispanic
## 5 5 FEMALE48.5_White
## 6 6 FEMALE36.9_NA
and the last six rows of data1
tail(data1)
## # A tibble: 6 x 2
## subject_ID SexAge_Race
## <int> <chr>
## 1 4995 FEMALE41.8_Hispanic
## 2 4996 FEMALE34.5_Black
## 3 4997 FEMALE40.5_NA
## 4 4998 FEMALE44.1_Black
## 5 4999 FEMALE46.4_Black
## 6 5000 FEMALE49.9_Black
We can see that there are 5000 subjects in the study and all the information about each subject is in one column. So, we will need to separate this information. Also, we can see that there are missing values for race (people can opt out from disclosing this type of information).
The object data1, besides being a data frame is a “tibble” object. Tibble objects have certain qualities that make data frames more suitable for use in the Tidyverse.
Note:
class(read.csv("DataFiles/EXER_age_sex_race.csv"))
## [1] "data.frame"
class(read_csv("DataFiles/EXER_age_sex_race.csv"))
## Parsed with column specification:
## cols(
## subject_ID = col_integer(),
## SexAge_Race = col_character()
## )
## [1] "tbl_df" "tbl" "data.frame"
We use the package readxl from the Tidyverse.
The function we use is read_excel(). Another useful function is excel_sheets() which lists different sheets in an excel workbook.
EXAMPLE. Importing an excel file sourced from Gapminder
Let us consider geographical information of countries in the world from https://www.gapminder.org/data/geo/. The excel file “DataGeographiesGapminder.xlsx”, in your working directory, is a workbook with many sheets. The second sheet is the one that contains the list of country names and different region denominations and other geographical information.
excel_sheets("DataFiles/DataGeographiesGapminder.xlsx")#names of sheets in workbook
## [1] "ABOUT" "List of countries" "List of regions"
## [4] "World"
The excel file “DataGeographiesGapminder.xlsx” has four sheets. If we would like to retrieve just the second sheet, we specify that as an argument, sheet = 2, of the read_excel() function:
continent <- read_excel("DataFiles/DataGeographiesGapminder.xlsx", sheet = 2)# import only the second sheet
head(continent)
## # A tibble: 6 x 11
## geo name four_regions eight_regions six_regions members_oecd_g77
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 afg Afgh… asia asia_west south_asia g77
## 2 alb Alba… europe europe_east europe_cen… others
## 3 dza Alge… africa africa_north middle_eas… g77
## 4 and Ando… europe europe_west europe_cen… others
## 5 ago Ango… africa africa_sub_s… sub_sahara… g77
## 6 atg Anti… americas america_north america g77
## # ... with 5 more variables: Latitude <dbl>, Longitude <dbl>, `UN member
## # since` <dttm>, `World bank region` <chr>, `World bank income group
## # 2017` <chr>
If we want to retrieve all the sheets we can use the function lapply() in base R.
sheetnames <- excel_sheets("DataFiles/DataGeographiesGapminder.xlsx")
# this is a list of four data frames (tibbles)
sheet_list <- lapply(sheetnames, function(x){read_excel("DataFiles/DataGeographiesGapminder.xlsx", sheet = x)})
# first six rows of the second tibble or data frame
head(sheet_list[[2]])
## # A tibble: 6 x 11
## geo name four_regions eight_regions six_regions members_oecd_g77
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 afg Afgh… asia asia_west south_asia g77
## 2 alb Alba… europe europe_east europe_cen… others
## 3 dza Alge… africa africa_north middle_eas… g77
## 4 and Ando… europe europe_west europe_cen… others
## 5 ago Ango… africa africa_sub_s… sub_sahara… g77
## 6 atg Anti… americas america_north america g77
## # ... with 5 more variables: Latitude <dbl>, Longitude <dbl>, `UN member
## # since` <dttm>, `World bank region` <chr>, `World bank income group
## # 2017` <chr>
Or we can use the function map() from the package purrr, part of the Tidyverse.
The function map() applies a function to each of the elements of a vector and outputs a list with as many entries as the length of the vector.
sheetnames <- excel_sheets("DataFiles/DataGeographiesGapminder.xlsx")
table_list <- map(sheetnames, ~read_excel("DataFiles/DataGeographiesGapminder.xlsx", sheet = .x))
head(table_list[[2]])
## # A tibble: 6 x 11
## geo name four_regions eight_regions six_regions members_oecd_g77
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 afg Afgh… asia asia_west south_asia g77
## 2 alb Alba… europe europe_east europe_cen… others
## 3 dza Alge… africa africa_north middle_eas… g77
## 4 and Ando… europe europe_west europe_cen… others
## 5 ago Ango… africa africa_sub_s… sub_sahara… g77
## 6 atg Anti… americas america_north america g77
## # ... with 5 more variables: Latitude <dbl>, Longitude <dbl>, `UN member
## # since` <dttm>, `World bank region` <chr>, `World bank income group
## # 2017` <chr>
Using the %>% operator
library(purrr)
sheetnames <- excel_sheets("DataFiles/DataGeographiesGapminder.xlsx")
table_list <- sheetnames %>% map(~read_excel("DataFiles/DataGeographiesGapminder.xlsx", sheet = .x))
head(table_list[[2]])
## # A tibble: 6 x 11
## geo name four_regions eight_regions six_regions members_oecd_g77
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 afg Afgh… asia asia_west south_asia g77
## 2 alb Alba… europe europe_east europe_cen… others
## 3 dza Alge… africa africa_north middle_eas… g77
## 4 and Ando… europe europe_west europe_cen… others
## 5 ago Ango… africa africa_sub_s… sub_sahara… g77
## 6 atg Anti… americas america_north america g77
## # ... with 5 more variables: Latitude <dbl>, Longitude <dbl>, `UN member
## # since` <dttm>, `World bank region` <chr>, `World bank income group
## # 2017` <chr>
Other packages that can be used to read excel files are gdata and XLConnect (can create workbooks and sheets to feed into excel).
The package haven, in the Tidyverse, provides easy to use functions to import data from SAS, STATA and SPSS
| Software | Function |
|---|---|
| SAS | read_sas() |
| STATA | read_dta() or read_stata() (identical functions) |
| SPSS | read_sav() or read_por() (depending on the file type to be imported) |
The functions to import data in the haven package take one essential argument: the file name with its path or location in your system or a URL.
The foreign package (which is not in the Tidyverse) can also be used to import data from other statistical software.
We use the function read_sas() in the package haven of the Tidyverse. We will import the file “tax.sas7bdat” which contains information about the income and tax paid of 30 US firms in 1988, 1989.
tax <- read_sas("DataFiles/tax.sas7bdat")
glimpse(tax)
## Observations: 30
## Variables: 4
## $ INC88 <dbl> 9.215, 2.047, 9.989, 8.321, 4.588, 4.736, 3.596, 4.830, ...
## $ TAX88 <dbl> 1.643, 0.413, 1.752, 1.408, 0.838, 0.748, 0.577, 0.752, ...
## $ INC89 <dbl> 9.518, 2.068, 9.992, 8.515, 4.389, 5.015, 3.811, 4.939, ...
## $ TAX89 <dbl> 2.125, 0.565, 2.221, 1.905, 0.943, 1.051, 0.819, 1.015, ...
head(tax)
## # A tibble: 6 x 4
## INC88 TAX88 INC89 TAX89
## <dbl> <dbl> <dbl> <dbl>
## 1 9.22 1.64 9.52 2.12
## 2 2.05 0.413 2.07 0.565
## 3 9.99 1.75 9.99 2.22
## 4 8.32 1.41 8.52 1.90
## 5 4.59 0.838 4.39 0.943
## 6 4.74 0.748 5.01 1.05
tail(tax)
## # A tibble: 6 x 4
## INC88 TAX88 INC89 TAX89
## <dbl> <dbl> <dbl> <dbl>
## 1 3.87 0.580 3.93 0.515
## 2 7.26 1.14 7.64 1.72
## 3 2.13 0.414 2.17 0.433
## 4 7.53 1.33 7.86 1.46
## 5 9.58 1.66 10.00 2.17
## 6 2.02 0.351 2.26 0.447
One can also use the package foreign but only SAS libraries (.xport) can be imported. It cannot import single .sas7bdat files.
We use the function read_sav() (for .sav files) or read_por() (for .por files), both in the package haven.
EXAMPLE. We import the file “sleep.sas”, in your working directory. The data has been downloaded from http://spss.allenandunwin.com.s3-website-ap-southeast-2.amazonaws.com/data-files.html#.W4rHuy2ZM_U
The data concerns the prevalence and impact of sleep problems on various aspects of people’s lives. Measured variables are sleep behaviour (e.g. hours slept per night), sleep problems (e.g. difficulty getting to sleep) and the impact that these problems have on aspects of their lives (work, driving, relationships). The sample consists of 271 individuals.
sleep <- read_sav("DataFiles/sleep.sav")
head(sleep)
## # A tibble: 6 x 55
## id sex age marital edlevel weight height healthrate fitrate
## <dbl> <dbl> <dbl> <dbl+l> <dbl+l> <dbl> <dbl> <dbl+lbl> <dbl+l>
## 1 83 0 42 2 2 52 162 10 7
## 2 294 0 54 2 5 65 174 " 8" 7
## 3 425 1 NA 2 2 89 170 " 6" 5
## 4 64 0 41 2 5 66 178 " 9" 7
## 5 536 0 39 2 5 62 160 " 9" 5
## 6 57 0 66 2 4 62 165 " 8" 8
## # ... with 46 more variables: weightrate <dbl+lbl>, smoke <dbl+lbl>,
## # smokenum <dbl>, alchohol <dbl>, caffeine <dbl>, hourwnit <dbl>,
## # hourwend <dbl>, hourneed <dbl>, trubslep <dbl+lbl>,
## # trubstay <dbl+lbl>, wakenite <dbl+lbl>, niteshft <dbl+lbl>,
## # liteslp <dbl+lbl>, refreshd <dbl+lbl>, satsleep <dbl+lbl>,
## # qualslp <dbl+lbl>, stressmo <dbl+lbl>, medhelp <dbl+lbl>,
## # problem <dbl+lbl>, impact1 <dbl+lbl>, impact2 <dbl+lbl>,
## # impact3 <dbl+lbl>, impact4 <dbl+lbl>, impact5 <dbl+lbl>,
## # impact6 <dbl+lbl>, impact7 <dbl+lbl>, stopb <dbl+lbl>,
## # restlss <dbl+lbl>, drvsleep <dbl+lbl>, drvresul <dbl+lbl>, ess <dbl>,
## # anxiety <dbl>, depress <dbl>, fatigue <dbl>, lethargy <dbl>,
## # tired <dbl>, sleepy <dbl>, energy <dbl>, stayslprec <dbl+lbl>,
## # getsleprec <dbl+lbl>, qualsleeprec <dbl+lbl>, totsas <dbl>,
## # cigsgp3 <dbl+lbl>, agegp3 <dbl+lbl>, probsleeprec <dbl+lbl>,
## # drvslprec <dbl+lbl>
print_labels(sleep$marital)#the factor columns have numeric entries, and each number has a name or label. Use this function to get the labels of a factor
##
## Labels:
## value label
## 1 single
## 2 married/defacto
## 3 divorced
## 4 widowed
We see that some factors, i.e. variables which are discrete and divide the observations into groups or levels, are labelled. In this case, factors indicate a group with a number but are not numerical variables. Those numbers, indicating groups, are labelled, i.e. have a name attached to them. For example, the variable marital is a factor with four levels: level 1 corresponds to single people, level 2 to married people, level 3 to divorced and level 4 to widowed people.
EXAMPLE. The file “trade.dta” contains information on yearly import and export numbers of sugar in the US, both in USD and in pounds (weight).
sugar <- read_dta("DataFiles/trade.dta")
glimpse(sugar)
## Observations: 10
## Variables: 5
## $ Date <dbl+lbl> 10, 9, 8, 7, 6, 5, 4, 3, 2, 1
## $ Import <dbl> 37664782, 16316512, 11082246, 35677943, 9879878, 1539...
## $ Weight_I <dbl> 54029106, 21584365, 14526089, 55034932, 14806865, 174...
## $ Export <dbl> 54505513, 102700010, 37935000, 48515008, 71486545, 12...
## $ Weight_E <dbl> 93350013, 158000010, 88000000, 112000005, 131800000, ...
print_labels(sugar$Date)
##
## Labels:
## value label
## 1 2004-12-31
## 2 2005-12-31
## 3 2006-12-31
## 4 2007-12-31
## 5 2008-12-31
## 6 2009-12-31
## 7 2010-12-31
## 8 2011-12-31
## 9 2012-12-31
## 10 2013-12-31
The values in the column Date are labelled and the labels are dates. The other four columns contain numerical values.
Data from the web can be read directly into R if it is a text file (comma or tab separated), sas or spss file. Instead of a file name to import, we specify a url. The process is straight forward. See the example below where a comma separated values file is imported from a website.
One cannot retrieve excel files directly from the web. First the file needs to be downloaded using the function download.file(). The process is illustrated in the second example below.
Excel files cannot be read directly from the web. Instead first the file needs to be downloaded
EXAMPLE. Excel file from Gapminder
We will import data about Adults with HIV (estimated prevalence of HIV in percentage, ages 15-49) from Gapminder https://docs.google.com/spreadsheet/pub?key=pyj6tScZqmEfbZyl0qjbiRQ&output=xlsx
The function read_excel() cannot download excel files directly from the web.
We use the function download.file() to download the file into a directory and then we use read_excel() to read it into R.
url <- "https://docs.google.com/spreadsheet/pub?key=pyj6tScZqmEfbZyl0qjbiRQ&output=xlsx"
download.file(url, "HIV.xlsx")
HIV <- read_excel("HIV.xlsx")
head(HIV)
## # A tibble: 6 x 34
## `Estimated HIV … `1979.0` `1980.0` `1981.0` `1982.0` `1983.0` `1984.0`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abkhazia NA NA NA NA NA NA
## 2 Afghanistan NA NA NA NA NA NA
## 3 Akrotiri and Dh… NA NA NA NA NA NA
## 4 Albania NA NA NA NA NA NA
## 5 Algeria NA NA NA NA NA NA
## 6 American Samoa NA NA NA NA NA NA
## # ... with 27 more variables: `1985.0` <dbl>, `1986.0` <dbl>,
## # `1987.0` <dbl>, `1988.0` <lgl>, `1989.0` <lgl>, `1990.0` <dbl>,
## # `1991.0` <dbl>, `1992.0` <dbl>, `1993.0` <dbl>, `1994.0` <dbl>,
## # `1995.0` <dbl>, `1996.0` <dbl>, `1997.0` <dbl>, `1998.0` <dbl>,
## # `1999.0` <dbl>, `2000.0` <dbl>, `2001.0` <dbl>, `2002.0` <dbl>,
## # `2003.0` <dbl>, `2004.0` <dbl>, `2005.0` <dbl>, `2006.0` <dbl>,
## # `2007.0` <dbl>, `2008.0` <dbl>, `2009` <chr>, `2010` <chr>,
## # `2011` <chr>
We will revisit this data set in Case Study 6.
The packages RMySQL, RPostgresSQL, ROracle can be used to import data from relational databases.
To use the RMySQL package (Database Interface and ‘MySQL’ Driver for R) we first install the package in the usual way install.packages("RMySQL") and then we load the DBI package using library(DBI). DBI stands for data base interface.
We need access to a MySQL database. Suppose that data base is called “DemoData”. The host is called “Demo.host.com”
Firstly, connect to the database
con <- dbConnect(RMySQL::MySQL(), dbname = "DemoData", host = "Demo.Host.com",
port = 3306, user = "myself", password = "itsasecret")
The object con is called a DBI connection object. RMySQL::MySQL() constructs a SQL driver. The values between quotation marks are variable (you must enter your relevant values).
Listing the table names in the database. The output is a character vector with the table names:
dbListTables(con)
Suppose that DemoData has a table called “DemoTable”. In order to read it:
dbReadTable(con, "DemoTable")
When you are finished, you should disconnect from the database with:
dbDisconnect(con)
To query the database before reading, for example,
dbGetQuery(con, "SELECT .... FROM ..... WHERE ....")
Happy families are all alike; every unhappy family is unhappy in its own way. Leo Tolstoy
It is impossible to predict how data may be presented to the analyst. Nearly each data set will have its own features, peculiarities and errors and we must be ready to work with any data shape. The task is then to transform data so that it is in a standardised format ready to be analysed using a system of choice, in our case R.
The quote above applies to data sets as: Tidy data sets are all alike; every messy data set is messy in its own way.
Hadley Wickham in his paper Tidy Data defines the three main qualities of tidy data which standardise the process of dealing with any data set:
Qualities 1 and 2 are crucial for adequate and ease of use of R packages for visualisation and analysis. Quality 3 is important in order to organise the data but it might be necessary to unify tables with related information at some point for analysis. For example, we may have a table with information about the response to a medical intervention for patients and another table with demographic data. But at some point they might need to be joined for analysis.
These three qualities are what the end product of the data tidying process must possess.
Messy data is data which is not tidy.
Applying the tidy data criteria standardises the structure of a data set, making exploration and analysis of data easier and less error-prone.
The order in which a variable is stored in a data set does not affect the analysis. However, it is good practice to order them in a way that makes easier understanding the structure of the data set. We consider two types of variables: fixed and measured variables. Fixed variables describe the experimental design and are known prior to data collection. Measured variables are those that are observed in the study. Fixed variables should come first, followed by measured variables, ordered so that related variables are contiguous. Rows are ordered by the first variable.
Common problems with messy data sets are:
The first three features of a messy data set are the most common.
It is important to note that a tidy data set is not a correct data set. Tidy data refers only to the format of data. There are, certainly, other problems with data sets such as errors, unusual observations, duplicated entries, missing data, and many more. But these are dealt with once the data is tidy.
tidyr, dplyr and stringr functions to tidy dataLet us work with very small data sets to exemplify the mechanics of data tidying.
EXAMPLE. Suppose that we have the following information on the number of text messages that Sue and Peter sent in 2016 and 2017.
| Name | 2016 | 2017 |
|---|---|---|
| Sue | 300 | 500 |
| Peter | 400 | 600 |
This is an illustration of a data set with messy data feature of type 1: column headers are values of a variable (Year), not variable names.
The observational units are Sue and Peter and variables are year and number of text messages sent. A tidy version of this data set should have the columns Name, Year, and number of text messages.
The column names, except for Name, are values of the variable Year. So, to tidy the data, we will “gather” the column names 2016 and 2017 and put them into a column which we will call “Year” and create a new column, say “NumberSMS”.
| Name | Year | NumberSMS |
|---|---|---|
| Sue | 2016 | 300 |
| Peter | 2016 | 400 |
| Sue | 2017 | 500 |
| Peter | 2017 | 600 |
How do we do this with R? There is a function called gather(), in the package tidyr.
First let us create the tibble (enhanced data frame) sms
c1 <- c("Sue", "Peter")
c2 <- c(300, 400)
c3 <- c(500, 600)
sms <- tibble("Name" = c1, "2016" = c2, "2017" = c3)
sms
## # A tibble: 2 x 3
## Name `2016` `2017`
## <chr> <dbl> <dbl>
## 1 Sue 300 500
## 2 Peter 400 600
A tibble, or tbl_df, is the Tidyverse version of the data.frame, keeping the good features of data.frames and discarding what is not. Tibbles are data.frames that don’t change variable names or types, and complain more (e.g. when a variable does not exist). This forces you to confront problems earlier, typically leading to cleaner, more expressive code. Tibbles also have an enhanced print() method which makes them easier to use with large data sets containing complex objects. If the tibble is very long or wide, when printing it, only parts of it will show, unlike a data.frame which will show it complete.
Now let us tidy the data
#"Year" the "key", the new column with the column names being gathered
#"NrSMS" the "value", new column that will have the values which used
# to be under 2016 and 2017
#the -1 stands for don't gather the first column
#
sms_tidy <- gather(sms, key = "Year", value = "NrSMS", -1)
sms_tidy
## # A tibble: 4 x 3
## Name Year NrSMS
## <chr> <chr> <dbl>
## 1 Sue 2016 300
## 2 Peter 2016 400
## 3 Sue 2017 500
## 4 Peter 2017 600
EXAMPLE Suppose that we have information on how many text messages and tweets Alice, Ann and John made in 2016.
| Name | Number | Type |
|---|---|---|
| Alice | 100 | sms |
| Alice | 200 | twt |
| Ann | 300 | sms |
| Ann | 400 | twt |
| John | 500 | sms |
| John | 600 | twt |
This is an illustration of messy data having feature 3: variables are stored in both rows and columns.
The observational units are Alice, Ann and John, and the variables measured are number of sms sent and number of tweets made. The column Type is not a measured variable and it stores the names of two variables: sms and tweet.
In order to tidy this data set we must “spread” the values of Type into two new columns, “sms” an “twt”, and distribute the values in Number accordingly.
| Name | sms | twt |
|---|---|---|
| Alice | 100 | 200 |
| Ann | 300 | 400 |
| John | 500 | 600 |
To do this in R let us first create the tibble smstwt
c1 <- c("Alice", "Alice", "Ann", "Ann", "John", "John")
c2 <- c(100, 200, 300, 400, 500, 600)
c3 <- c("sms", "twt", "sms","twt", "sms", "twt")
smstwt <- tibble("Name" = c1, "Number" = c2, "Type" = c3)
smstwt
## # A tibble: 6 x 3
## Name Number Type
## <chr> <dbl> <chr>
## 1 Alice 100 sms
## 2 Alice 200 twt
## 3 Ann 300 sms
## 4 Ann 400 twt
## 5 John 500 sms
## 6 John 600 twt
And now, to spread the values of Type into columns,
smstwt_tidy <- spread(smstwt, Type, Number)
smstwt_tidy
## # A tibble: 3 x 3
## Name sms twt
## <chr> <dbl> <dbl>
## 1 Alice 100 200
## 2 Ann 300 400
## 3 John 500 600
EXAMPLE. Now we also have data about the number of sms sent and tweets made by Alice, Ann and John in 2017. The data is presented as follows
| Name | 2016_sms | 2016_twt | 2017_sms | 2017_twt |
|---|---|---|---|---|
| Alice | 100 | 200 | 10 | 20 |
| Ann | 300 | 400 | 30 | 40 |
| John | 500 | 600 | 50 | 60 |
Although the data without the last two columns was tidy, this data is messy.
Observational units are Alice, Ann and John and measured variables are year, number of tweets and number of sms.
This data set is an illustration of features 2 and 3 of a messy data set.
Let us create a tibble, smstwt2, with the data.
c1 <- c("Alice", "Ann", "John")
c2 <- c(100, 300, 500)
c3 <- c(200, 400, 600)
c4 <- c(10, 30, 50)
c5 <- c(20, 40, 60)
smstwt2 <- tibble("Name" = c1, "2016_sms" = c2, "2016_twt" = c3, "2017_sms" = c4, "2017_twt" = c5)
smstwt2
## # A tibble: 3 x 5
## Name `2016_sms` `2016_twt` `2017_sms` `2017_twt`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Alice 100 200 10 20
## 2 Ann 300 400 30 40
## 3 John 500 600 50 60
First of all the names of the columns, except Name, are values of variables. So, the first thing we need to do is gather all column names into one new column named “year_type” and put the corresponding values in the new column “Number”.
smstwt2_g <- gather(smstwt2, "year_type", "Number",-1)
smstwt2_g
## # A tibble: 12 x 3
## Name year_type Number
## <chr> <chr> <dbl>
## 1 Alice 2016_sms 100
## 2 Ann 2016_sms 300
## 3 John 2016_sms 500
## 4 Alice 2016_twt 200
## 5 Ann 2016_twt 400
## 6 John 2016_twt 600
## 7 Alice 2017_sms 10
## 8 Ann 2017_sms 30
## 9 John 2017_sms 50
## 10 Alice 2017_twt 20
## 11 Ann 2017_twt 40
## 12 John 2017_twt 60
This data set is still messy because the column year_type has feature 2 of a messy data set: multiple variables are stored in one column.
We must separate column year_type into two new columns: “year” and “type”. The function separate() from tidyr does the job.
smstwt2_gs <- separate(smstwt2_g, "year_type", c("year","type"), sep = "_")
smstwt2_gs
## # A tibble: 12 x 4
## Name year type Number
## <chr> <chr> <chr> <dbl>
## 1 Alice 2016 sms 100
## 2 Ann 2016 sms 300
## 3 John 2016 sms 500
## 4 Alice 2016 twt 200
## 5 Ann 2016 twt 400
## 6 John 2016 twt 600
## 7 Alice 2017 sms 10
## 8 Ann 2017 sms 30
## 9 John 2017 sms 50
## 10 Alice 2017 twt 20
## 11 Ann 2017 twt 40
## 12 John 2017 twt 60
The resulting data set has feature 3 of a messy data set: variables are stored in both rows and columns.
To continue tidying we must “spread” the values of type into two new columns, “sms” and “twt”, with the corresponding values of Number.
smstwt2_tidy <- spread(smstwt2_gs, type, Number)
smstwt2_tidy
## # A tibble: 6 x 4
## Name year sms twt
## <chr> <chr> <dbl> <dbl>
## 1 Alice 2016 100 200
## 2 Alice 2017 10 20
## 3 Ann 2016 300 400
## 4 Ann 2017 30 40
## 5 John 2016 500 600
## 6 John 2017 50 60
We can use %>% to do the entire tidying in one go, without creating the intermediate objects.
smstwt_tidy <- smstwt2 %>%
gather("year_type", "Number",-1) %>%
separate("year_type", c("year", "type"), sep = "_") %>%
mutate(year = as.numeric(year)) %>%
spread(type, Number)
smstwt_tidy
## # A tibble: 6 x 4
## Name year sms twt
## <chr> <dbl> <dbl> <dbl>
## 1 Alice 2016 100 200
## 2 Alice 2017 10 20
## 3 Ann 2016 300 400
## 4 Ann 2017 30 40
## 5 John 2016 500 600
## 6 John 2017 50 60
EXAMPLE. We received information about the ages of Alice, Ann, John, Peter and Sue
| Name | Age |
|---|---|
| Alice | 15 |
| Ann | 25 |
| John | 35 |
| Peter | 45 |
| Sue | 55 |
Now, we have feature 5 of messy data sets: a single observational unit is stored in multiple tables.
Let us enter the new information into R.
dem1 <- c("Alice", "Ann", "John", "Peter", "Sue")
dem2 <- c(15, 25, 35, 45, 55)
dem <- tibble("Name" = dem1, "Age" = dem2)
dem
## # A tibble: 5 x 2
## Name Age
## <chr> <dbl>
## 1 Alice 15
## 2 Ann 25
## 3 John 35
## 4 Peter 45
## 5 Sue 55
We would like to add this information to our data sets sms_tidy and smstwt2_tidy.
The function inner_join() of the dplyr package will join data sets only by matching the entries of all common columns or by a specific column in both data sets.
inner_join(sms_tidy,dem)
## Joining, by = "Name"
## # A tibble: 4 x 4
## Name Year NrSMS Age
## <chr> <chr> <dbl> <dbl>
## 1 Sue 2016 300 55
## 2 Peter 2016 400 45
## 3 Sue 2017 500 55
## 4 Peter 2017 600 45
inner_join(smstwt2_tidy,dem)
## Joining, by = "Name"
## # A tibble: 6 x 5
## Name year sms twt Age
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Alice 2016 100 200 15
## 2 Alice 2017 10 20 15
## 3 Ann 2016 300 400 25
## 4 Ann 2017 30 40 25
## 5 John 2016 500 600 35
## 6 John 2017 50 60 35
Observe that in the last data set age is added twice for each observational unit. This in itself can be considered a messy feature, namely feature 4: multiple types in one table. We may need to keep, for example, demographic information about observational units in a separate data sets, or we may need to merge both data sets into one for the purpose of analysis.
It is important to remark that messy data is not necessarily incorrect data. The messy aspect of data sets sometimes arises because there is a confusion of the concepts of data for presentation and data for analysis. Elegantly presented data may constitute a very messy data set and a tidy data set is usually unsuitable for presentation. However, many times messy data is just badly organised data.
Below is the number of tweets that Sue and Peter, who we met in the first example, made in 2016 and 2017.
| Name | 2016 | 2017 |
|---|---|---|
| Sue | 30 | 50 |
| Peter | 40 | 60 |
Produce a tibble which is in a tidy format containing the information about the number of text messages Sue and Peter sent and the number of tweets they made in 2016 and 2017.
We will take a hands-on approach to illustrate the data tidying process with seemingly well-presented data which is (very) messy, with plainly badly organised and presented data, and from different sources and in different formats. We will continue to learn how to tidy and prepare a data set ready for analysis in R. This will be done via case-studies, using data in a variety of formats and each with unique challenges.
We consider graduate admission figures for the autumn of 1973 at the University of California, Berkeley. The numbers, shown below, seem to imply that men applying for post-graduate studies were more likely than women to be admitted. It was argued that the difference was so large that it was unlikely to be due to chance. The data set is usually presented as follows:
| Applicants | Admitted | |
|---|---|---|
| Men | 8442 | 44% |
| Women | 4321 | 35% |
There is nothing wrong with the data in terms of how it is presented. The presentation is informative. However, for analysis purposes, this is an example of “messy data”.
In this example an observational unit is an applicant and the measured variables are gender (male, female) and admission status (admitted, not admitted). Although we can certainly infer the information from the table above, the variables are not explicit which can complicate analysis and introduce error.
Since we are not interested in any other characteristics of the observational units we may aggregate cases by Gender and Admission Status.
We next present the same information in a different way.
| Gender | Admission_status | Number |
|---|---|---|
| male | admitted | 3714 |
| male | rejected | 4728 |
| female | admitted | 1512 |
| female | rejected | 2809 |
This is the tidy version of the messy data set above. Using the original data would require different strategies to obtain the variables of interest. This would slow down the analysis process and would make it more error-prone.The tidy version spells out the variables in a clear and standard way ready for analyses.
The data above has been aggregated through departments. However, it is known that different genders have different preferences of departments they apply to (women have a preference for e.g. psychology or English studies, men have a preference for e.g. engineering studies).
When examining the individual departments, it appeared that six out of 85 departments were significantly biased against men, whereas only four were significantly biased against women. The data from the six largest departments are listed below.
Is this a tidy data set? If not, how can it be transformed into a tidy data set?
| Department | Applied_men | Admitted_men | Applied_women | Admitted_women |
|---|---|---|---|---|
| A | 825 | 62% | 108 | 82% |
| B | 560 | 63% | 25 | 68% |
| C | 325 | 37% | 593 | 34% |
| D | 417 | 33% | 375 | 35% |
| E | 191 | 28% | 393 | 24% |
| F | 373 | 6% | 341 | 7% |
Let us create a data frame in R with the information above. The file ``BerkeleyData.txt" contains the data.
Berk_messy2 <- read_table2("DataFiles/BerkeleyData.txt")
## Parsed with column specification:
## cols(
## Department = col_character(),
## Applied_men = col_integer(),
## Admitted_men_p = col_character(),
## Applied_women = col_integer(),
## Admitted_women_p = col_character()
## )
Berk_messy2
## # A tibble: 6 x 5
## Department Applied_men Admitted_men_p Applied_women Admitted_women_p
## <chr> <int> <chr> <int> <chr>
## 1 A 825 62% 108 82%
## 2 B 560 63% 25 68%
## 3 C 325 37% 593 34%
## 4 D 417 33% 375 35%
## 5 E 191 28% 393 24%
## 6 F 373 6% 341 7%
The data in Berk_messy2, does not comply with the principles of tidy data. The variables in this study are Department (fixed), Gender (fixed), Number of applicants (measured) and Admission Status (measured). So, there is some manipulation to do before we can obtain a data array with 4 columns.
Firstly, note that we need to transform the information in percentages to admitted and non-admitted status numbers.
glimpse(Berk_messy2)
## Observations: 6
## Variables: 5
## $ Department <chr> "A", "B", "C", "D", "E", "F"
## $ Applied_men <int> 825, 560, 325, 417, 191, 373
## $ Admitted_men_p <chr> "62%", "63%", "37%", "33%", "28%", "6%"
## $ Applied_women <int> 108, 25, 593, 375, 393, 341
## $ Admitted_women_p <chr> "82%", "68%", "34%", "35%", "24%", "7%"
The symbol “%” needs to be eliminated in order to be able to compute the number of admitted and non-admitted applicants. We use the function str_replace() of the package stringr and the mutate_at() function to apply the function to columns 3 and 5 in one call. If you want to replace a character that appears more than once in each entry of a characer string and you want to replace all instances, use str_replace_all(). str_replace() would only replace the first instance.
#remove percentage sign from columns 3 and 5
Berk_messy2 <- mutate_at(Berk_messy2, c("Admitted_men_p", "Admitted_women_p"), str_replace, "%", "")
#make columns 3 and 5 numeric
Berk_messy2 <- mutate_at(Berk_messy2, c("Admitted_men_p", "Admitted_women_p"), as.numeric)
Berk_messy2
## # A tibble: 6 x 5
## Department Applied_men Admitted_men_p Applied_women Admitted_women_p
## <chr> <int> <dbl> <int> <dbl>
## 1 A 825 62 108 82
## 2 B 560 63 25 68
## 3 C 325 37 593 34
## 4 D 417 33 375 35
## 5 E 191 28 393 24
## 6 F 373 6 341 7
glimpse(Berk_messy2)
## Observations: 6
## Variables: 5
## $ Department <chr> "A", "B", "C", "D", "E", "F"
## $ Applied_men <int> 825, 560, 325, 417, 191, 373
## $ Admitted_men_p <dbl> 62, 63, 37, 33, 28, 6
## $ Applied_women <int> 108, 25, 593, 375, 393, 341
## $ Admitted_women_p <dbl> 82, 68, 34, 35, 24, 7
summary(Berk_messy2)
## Department Applied_men Admitted_men_p Applied_women
## Length:6 Min. :191.0 Min. : 6.00 Min. : 25.0
## Class :character 1st Qu.:337.0 1st Qu.:29.25 1st Qu.:166.2
## Mode :character Median :395.0 Median :35.00 Median :358.0
## Mean :448.5 Mean :38.17 Mean :305.8
## 3rd Qu.:524.2 3rd Qu.:55.75 3rd Qu.:388.5
## Max. :825.0 Max. :63.00 Max. :593.0
## Admitted_women_p
## Min. : 7.00
## 1st Qu.:26.50
## Median :34.50
## Mean :41.67
## 3rd Qu.:59.75
## Max. :82.00
Now, we need to convert percentages into admitted and not-admitted numbers. We want whole numbers, integers, so we use the function round() with the argument digits=0 and mutate(), which creates or modifies a column.
Berk_messy2 <- mutate(Berk_messy2, Admitted_M = round(Admitted_men_p * Applied_men/100, digits=0))
Berk_messy2 <- mutate(Berk_messy2, Admitted_F = round(Admitted_women_p * Applied_women/100, digits=0))
Berk_messy2 <- mutate(Berk_messy2, NotAdmitted_M = Applied_men - Admitted_M)
Berk_messy2 <- mutate(Berk_messy2, NotAdmitted_F = Applied_women - Admitted_F)
#
glimpse(Berk_messy2)
## Observations: 6
## Variables: 9
## $ Department <chr> "A", "B", "C", "D", "E", "F"
## $ Applied_men <int> 825, 560, 325, 417, 191, 373
## $ Admitted_men_p <dbl> 62, 63, 37, 33, 28, 6
## $ Applied_women <int> 108, 25, 593, 375, 393, 341
## $ Admitted_women_p <dbl> 82, 68, 34, 35, 24, 7
## $ Admitted_M <dbl> 512, 353, 120, 138, 53, 22
## $ Admitted_F <dbl> 89, 17, 202, 131, 94, 24
## $ NotAdmitted_M <dbl> 313, 207, 205, 279, 138, 351
## $ NotAdmitted_F <dbl> 19, 8, 391, 244, 299, 317
# remove the percentages columns
Berk_messy3 <- select(Berk_messy2, -c(3,5))
glimpse(Berk_messy3)
## Observations: 6
## Variables: 7
## $ Department <chr> "A", "B", "C", "D", "E", "F"
## $ Applied_men <int> 825, 560, 325, 417, 191, 373
## $ Applied_women <int> 108, 25, 593, 375, 393, 341
## $ Admitted_M <dbl> 512, 353, 120, 138, 53, 22
## $ Admitted_F <dbl> 89, 17, 202, 131, 94, 24
## $ NotAdmitted_M <dbl> 313, 207, 205, 279, 138, 351
## $ NotAdmitted_F <dbl> 19, 8, 391, 244, 299, 317
# remove the total number of applications columns
Berk_messy3 <- select(Berk_messy3, -c(2,3))
glimpse(Berk_messy3)
## Observations: 6
## Variables: 5
## $ Department <chr> "A", "B", "C", "D", "E", "F"
## $ Admitted_M <dbl> 512, 353, 120, 138, 53, 22
## $ Admitted_F <dbl> 89, 17, 202, 131, 94, 24
## $ NotAdmitted_M <dbl> 313, 207, 205, 279, 138, 351
## $ NotAdmitted_F <dbl> 19, 8, 391, 244, 299, 317
We have managed to tidy the data a bit, but we are not there yet. Recall that the variables in this study are: gender, department, admission status and numbers observed. We would like the data to be stored in four columns. As it is, the data combines two variables, gender and admission status in single columns.
Let us use the package tidyr to manipulate the data further.
The column names Admitted_F, Admitted_M, NotAdmitted_F, NotAdmitted_M are not variable names but values of two variables: admission status and gender. So, first of all we will create a single column named admitted_status_gender which will take on the values Admitted_F, Admitted_M, NotAdmitted_F, NotAdmitted_M, and a new column, number, with the resulting values for each category. We leave the first column, Department where it is (the -1 argument in the gather() call below indicates that the first column should be left out of the “gathering” process.)
Berk_messy4 <- gather(Berk_messy3, "admitted_status_gender", "number", -1)
Berk_messy4
## # A tibble: 24 x 3
## Department admitted_status_gender number
## <chr> <chr> <dbl>
## 1 A Admitted_M 512
## 2 B Admitted_M 353
## 3 C Admitted_M 120
## 4 D Admitted_M 138
## 5 E Admitted_M 53
## 6 F Admitted_M 22
## 7 A Admitted_F 89
## 8 B Admitted_F 17
## 9 C Admitted_F 202
## 10 D Admitted_F 131
## # ... with 14 more rows
Note that the admitted_status_gender column actually combines two variables: admission status and gender. So we must separate this column into two, namely “AdmissionStatus” and “Gender”. We use the separate() function of the tidyr package.
Berk_tidy2 <- separate(Berk_messy4, admitted_status_gender, c("AdmissionStatus","Gender"),
sep = "_")
Berk_tidy2
## # A tibble: 24 x 4
## Department AdmissionStatus Gender number
## <chr> <chr> <chr> <dbl>
## 1 A Admitted M 512
## 2 B Admitted M 353
## 3 C Admitted M 120
## 4 D Admitted M 138
## 5 E Admitted M 53
## 6 F Admitted M 22
## 7 A Admitted F 89
## 8 B Admitted F 17
## 9 C Admitted F 202
## 10 D Admitted F 131
## # ... with 14 more rows
Finally, we re-order columns and sort rows by the first column, Department, using the arrange() function.
# rearrange columns, fixed variables first
Berk_tidy2 <- select(Berk_tidy2, c(1, 3, 2, 4))
# order rows by Department
Berk_tidy2 <- arrange(Berk_tidy2, Department)
Berk_tidy2
## # A tibble: 24 x 4
## Department Gender AdmissionStatus number
## <chr> <chr> <chr> <dbl>
## 1 A M Admitted 512
## 2 A F Admitted 89
## 3 A M NotAdmitted 313
## 4 A F NotAdmitted 19
## 5 B M Admitted 353
## 6 B F Admitted 17
## 7 B M NotAdmitted 207
## 8 B F NotAdmitted 8
## 9 C M Admitted 120
## 10 C F Admitted 202
## # ... with 14 more rows
The data is now tidy and ready for exploratory analysis.
A summary of the tidying process is given below.
Berk_tidy <- read_table2("DataFiles/BerkeleyData.txt") %>%
mutate_at(c("Admitted_men_p", "Admitted_women_p"), str_replace, "%", "") %>%
mutate_at(c("Admitted_men_p", "Admitted_women_p"), as.numeric) %>%
mutate(Admitted_M = round(Admitted_men_p * Applied_men/100)) %>%
mutate(Admitted_F = round(Admitted_women_p * Applied_women/100)) %>%
mutate(NotAdmitted_M = Applied_men - Admitted_M) %>%
mutate(NotAdmitted_F = Applied_women - Admitted_F) %>%
select(-c(2:5)) %>%
gather("admitted_status_gender", "number", -1) %>%
separate(admitted_status_gender, c("AdmissionStatus","Gender"), sep = "_") %>%
select(c(1,3,2,4)) %>%
arrange(Department)
If we want to keep a copy of the clean data in a file
write_csv(Berk_tidy2, "DataFiles/BerkeleyDataTidy.csv")
#read in the file we just saved
aux <- read_csv("DataFiles/BerkeleyDataTidy.csv")
## Parsed with column specification:
## cols(
## Department = col_character(),
## Gender = col_character(),
## AdmissionStatus = col_character(),
## number = col_integer()
## )
#check the data was saved in the correct format
aux
## # A tibble: 24 x 4
## Department Gender AdmissionStatus number
## <chr> <chr> <chr> <int>
## 1 A M Admitted 512
## 2 A F Admitted 89
## 3 A M NotAdmitted 313
## 4 A F NotAdmitted 19
## 5 B M Admitted 353
## 6 B F Admitted 17
## 7 B M NotAdmitted 207
## 8 B F NotAdmitted 8
## 9 C M Admitted 120
## 10 C F Admitted 202
## # ... with 14 more rows
We visualise this data set of categorical, non-ordinal data using a mosaic plot. Mosaic plots are useful for visualizing proportions in more than 2 dimensions. We use the package vcd.
First, we display the data as a contingency table
# create a contingency table
ct1 <- xtabs(number ~ Gender + AdmissionStatus, data = Berk_tidy2)
ct1
## AdmissionStatus
## Gender Admitted NotAdmitted
## F 557 1278
## M 1198 1493
Now we display it as a mosaic plot
mosaic(ct1)
In this mosaic plot for the entire university data we can see that many more males are accepted to graduate school than females. Next we visualise the data in a similar way but by department or academic unit.
# create a contingency table
aux2 <- xtabs(number ~ Department + AdmissionStatus + Gender, data = Berk_tidy2)
mosaic(aux2)
The heights and lengths of each mosaic are proportional to the proportions in the margins. So, a very flat rectangle indicates, proportionally,very few applicants of the corresponding gender. A long rectangle in the admitted status indicates that, proportionally, it’s not so difficult to be accepted in the corresponding department. As we can see there is no evidence for a discrimination case. In Departments A and B applicants are mainly male but in Department A, proportionally, more female than male applicants were admitted. Department F is very competitive and has a high rejection rate, which applies nearly equally to both female and male applicants.
This mosaic also shows the explanation: Selective departments have more female applicants. It’s easy to see since the departments are ordered by selectiveness. Departments A and B let in many applicants, but they’re mostly male. The reverse is true for the rest. This means that the overall female population takes big admittance hits in departments C through F, while lots of males get in via departments A and B.
One of the perils when studying associations between a variable of interest and a set of explanatory variables is overfitting. If we use too many explanatory variables we may explain very well the observed values of the variable of interest but nothing else and so our study will have little predictive value.
Problems also occur when relevant explanatory variables are ignored. It is possible that when one ignores a relevant variable one observes an effect and when the variable is considered the opposite effect is observed. This is called Simpson’s paradox. What we have explored with the Berkeley graduate admissions data is one of the best-known examples of Simpson’s paradox.
Take the original Berkeley admission data (without departments) and enter it in R. Then manipulate it using R functions until you arrive to its tidy version.